Adversarial Attacks

LaVAN - An Adversarial Patch Attack

Published: 28.08.2024
PythonTensorflow

An adversarial patch is a digital or physical object which is placed into the input image of a convolutional neural network with the goal of triggering a missclassification. In this article we will take a look at LaVAN - an adversarial patch attack proposed by Karmon et al. [1]. I will explain how the attack works and present my implementation of the attack using Tensorflow 2.

How does LaVAN work?

The goal of the attack is to maximize the probability of the chosen target class $y_{\text{target}}$ and to minimize the probability of the true/correct class $y_{\text{source}}$ which leaves us with the following optimization problem:

$$ \argmax_{p} \, [M(y = y_{\text{target}}\,|\,x') - M(y = y_{\text{source}}\,|\,x')] $$

Here $ M(y = y'\,|\,x') $ denotes the activation (or score) of class $y'$ before the softmax layer and $x'$ is the image $x$ we want to attack with the patch $p$ inserted into it. According to the authors using the activations before softmax speeds up convergence.

To generate the patch, we first have to choose an input image $x$ that we want to attack, a position and the width and height (or alternatively a mask) for the patch $p$. After initializing the patch (e.g. with zeros) it is generated iteratively. Each iteration the gradients of the scores of our target class $y_{\text{target}}$ and the source class $y_{\text{source}}$ with respect to the attacked image $x'$ have to be calculated (which can be done with a single backward pass of our model). Let's call these gradients $\nabla_{\text{target}}$ and $\nabla_{\text{source}}$: $$ \begin{align*} L_{\text{target}} &= M(y = y_{\text{target}}\,|\,x')\\ L_{\text{source}} &= M(y = y_{\text{source}}\,|\,x')\\ \nabla_{\text{target}} &= \frac{\partial L_{\text{target}}}{\partial x}\\ \nabla_{\text{source}} &= \frac{\partial L_{\text{source}}}{\partial x} \end{align*} $$ The gradients $\nabla_{\text{target}}$ and $\nabla_{\text{source}}$ provide one value for each value in our input image. If the gradient is positive, it means that the score of the class the gradient was generated for will increase if we increase the corresponding value in the input image. If the gradient is negative, increasing the corresponding value in the input image will decrease the class score. With this in mind, we have to update the patch $p$ like so: $$p = p - \epsilon \cdot (\nabla_{\text{source}} - \nabla_{\text{target}})$$ After that we can clip the updated patch to the image domain (e.g.: [0-255]) if desired and finally we have to update the image $x$ with the updated patch $p$.

Now we can do the above steps for a given amount of iterations or we can stop when the generated patch causes the score of the target class to be greater than e.g. 80%.

Implementation

The Python code for Tensorflow 2 looks like this:

import tensorflow as tf from keras import Model import numpy as np @tf.function def gradient_tape(input_img, logit_model, target_index, source_index): input_img = tf.convert_to_tensor(input_img, dtype=tf.float32) # you might have to use dtype=tf.float64 here ^ with tf.GradientTape(persistent=True) as tape: tape.watch(input_img) preds = logit_model(input_img) loss_target = preds[:, target_index] loss_source = preds[:, source_index] # get the gradients of the loss w.r.t. to the input image gradient_target = tape.gradient(loss_target, input_img) gradient_source = tape.gradient(loss_source, input_img) del tape return gradient_target, gradient_source def get_gradient(input_img, logit_model: Model, target_index, source_index): gradient_target, gradient_source = gradient_tape(input_img, logit_model, target_index, source_index) # convert to numpy array gradient_target = gradient_target.numpy()[0] gradient_source = gradient_source.numpy()[0] return gradient_target, gradient_source def laVAN(model, img, x, y, width, height, epsilon, iterations, target_class, subtract_mean=None, clip_range=None): """ Parameters: img - the preprocessed input image to be attacked x, y, width, height - the position and size of the patch epsilon - this value scales the gradients iterations - number of iterations target_class - the target class of the attack subtract_mean - a mean value that is subtracted from the input images of the model for preprocessing, e.g.: [50.0, 13.0, 123.0] clip_range - range for clipping the patch, e.g.: [0.0, 255.0] """ patch = np.zeros((height, width, 3)) if(subtract_mean is not None): patch[:, :] -= subtract_mean source_class = model.predict(img, verbose=0)[0].argmax() # using the second last layer to get the layer before softmax: logit_model = Model(inputs=model.inputs, outputs=model.layers[-2].output) for i in range(iterations): gradient_target, gradient_source = get_gradient(img, logit_model, target_class, source_class) gradient_target = gradient_target[y:(y+height), x:(x+width)] gradient_source = gradient_source[y:(y+height), x:(x+width)] if(clip_range is not None): patch[..., 0] = np.clip(patch[..., 0], -subtract_mean[0]-clip_range[0], -subtract_mean[0]+clip_range[1]) patch[..., 1] = np.clip(patch[..., 1], -subtract_mean[1]-clip_range[0], -subtract_mean[1]+clip_range[1]) patch[..., 2] = np.clip(patch[..., 2], -subtract_mean[2]-clip_range[0], -subtract_mean[2]+clip_range[1]) patch = patch - (epsilon * (gradient_source - gradient_target)) img[0, y:(y+height), x:(x+width)] = patch if(i % 100 == 0): preds = model.predict(img, verbose = 0) print(preds.argmax(), preds[0, preds.argmax()]) Here is an example with a pretrained model: import tensorflow as tf import numpy as np model = tf.keras.applications.ResNet50() img = tf.keras.utils.load_img("[img-path]", target_size=(224, 224)) img = tf.keras.utils.img_to_array(img) img = np.expand_dims(img, axis=0) img = tf.keras.applications.resnet50.preprocess_input(img) mean = [103.939, 116.779, 123.68] # from: https://github.com/keras-team/keras/blob/master/keras/src/applications/imagenet_utils.py#L161 laVAN(model, img, 27, 85, 48, 48, 5000, 1500, 603, mean, [0.0, 255.0]) preds = model.predict(img, verbose = 0) print(preds.argmax(), preds[0, preds.argmax()])

I found that for epsilon values around 1000-5000 work well and for the iteration count values up to 10000 seem reasonable.

The code was tested with the following package versions on Windows:

tensorflow-gpu 2.10.0 python 3.9.19 cudatoolkit 11.8.0 cudnn 8.9.7.29

There also exist implementations of this attack for Tensorflow 1 and PyTorch.


Sources

[1]
D. Karmon, D. Zoran, and Y. Goldberg. LaVAN: Localized and visible adversarial noise. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2507-2515. PMLR, 10-15 Jul 2018. http://proceedings.mlr.press/v80/karmon18a/karmon18a.pdf.